Load the data

The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), ‘patterned’ precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.

Schematic depicting the iPSC differentiation strategy for motor neurogenesis. Arrows indicate sampling time-points in days when cells were fractionated into nuclear and cytoplasmic compartments prior to deep (polyA) RNA-sequencing. Four iPSC clones were obtained from four different healthy controls and three iPSC clones from two ALS patients with VCP mutations: R155C and R191Q; hereafter termed VCPmu. Induced-pluripotent stem cells (iPSC); neural precursors (NPC); “patterned” precursor motor neurons (ventral spinal cord; pMN); post-mitotic but electrophysiologically inactive motor neurons (MN); electrophysiologically active MNs (mMN). The gene expression count data was obtained from Kallisto (Bray et al., 2016) using the Gencode hg38 release Homo sapiens transcriptome.

# Loading the data
load("./data_09_04_2024.RData")

# Composition of the data: 
# myE_ge                  : raw gene expression count matrix 
# info                    : sample annotation (data-frame)
# ttg                     : rows (genes) annotation

# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge      <- myE_ge[,sel_samples]
info        <- info[sel_samples,]
info$group  <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))


# Make some nice colors to facilitate the visualisation of time-points
mytime                 <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days            <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days)     <- c(0,3,7,14,22,35)
mycols                 <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))

Get familiar with your data

How many samples is there in your count data?

Does this correspond to your experimental protocol?

Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file.

What are the covariates?

how many rows?

Check that the rowID of the data-count corresponds to the ensembl gene ID in your gene annotation file.

#View(info)
#nrow(info)
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
match(colnames(myE_ge),info$sampleID)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
sum(is.na(match(rownames(myE_ge),ttg$ens_gene)))
## [1] 0

How does your data look-like?

par(mfrow=c(2,3))
plot(density(myE_ge[,1]),main="Read count distribution in sample 1")
boxplot(myE_ge,las=1,ylab="raw gene count",main="with outliers")
boxplot(myE_ge,outline=FALSE,las=1,ylab="raw gene count",main="without outliers")

# t() computes the transpose
plot(density(t(myE_ge[1,])),main="Read count distribution across samples")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),las=1,ylab="raw gene count",main="with outliers")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),outline=FALSE,las=1,ylab="raw gene count",main="without outliers")

Pre-processing of the count data

Variance stabilisation with log-transformation

Analysis of the variance of the gene count across the \(N\) samples: \(\sigma^2=Var(X)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\) and \(\mu=\frac{1}{n}\sum_{i=1}^n\)x_i:

Let’s have a close look at how the variance in gene expression scale with average and then correct it with some transformation.

#calculate mean and variance of the rows
row_avg <- apply(myE_ge,1,mean)
row_var <- apply(myE_ge,1,var)

#Log-transform your count data
myE_gel <- log2(1+myE_ge)

#DESeq2 variance stabilisation
vsd    <- DESeq2::varianceStabilizingTransformation(matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE))

par(mfrow=c(1,3))

#Variance scale with average --> Poisson distribution
plot(row_avg,row_var,las=1,main="RAW data",pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="mean read count",ylab="variance read count",xlim=c(0,5000),ylim=c(0,10^7))
grid()

plot(apply(myE_gel,1,mean),apply(myE_gel,1,var),las=1,main="Log2-transformed data",pch=19,col=rgb(0,0,0,0.1),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()

plot(apply(vsd,1,mean),apply(vsd,1,var),las=1,main="DESeq2 variance stabilised",pch=19,col=rgb(0,0,0,0.05),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()

Identification of reliably expressed genes

Let’s first have a look at the distribution of gene expression for a few samples:

par(mfrow=c(1,3))
geneplotter::multidensity(myE_gel[,c(1,2,4,10)],main="Read count distributions",las=1,xlab="read count [log2]")
plot(density(myE_gel[,1]),las=1,main=colnames(myE_gel)[1],las=1,xlab="read count [log2]")
plot(density(myE_gel[,3]),las=1,main=colnames(myE_gel)[3],las=1,xlab="read count [log2]")

We can identify reliably expressed genes by fitting a bimodal distribution to the log2-read count distribtution of each samples to discriminate between the foreground and the background transcription. The limit must be fitted to each individual sample given that the library size will impact this factor.

We can now select for the reliably expressed genes in each sample:

Normalisation

Let’s first look at the ditribution of the gene count across a few samples:

There are several options for normalisation. Scaling factors, quantile normalisation etc.

What is the effect on the gene count?

Unsupervised hierarchical clustering analysis of the samples

Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.

Let’s first compare the clustering of the samples using the Manhattan distance and Ward D algorithm:

Samples are very similarly clustered in Quantile and Deseq2 normalised methods. Let’s compare the different effect of the distances used as well as agglomerative alorithm. From now we will use the quantile normalised matrix:

Differential gene expression analysis - Graded homework

Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.

Task 1 (1 pt)

In this task, you will perform a differential gene expression using Deseq2.

  1. Question: For each time point, compute the number of differentially expressed genes (up and down). Show these numbers in barplots: you should obtain a barplot with 5 bars (one for each time point) for the upregulated genes, and similarly one barplot for the downregulated genes. Comment on the evolution of these numbers.

Answer: The aim is to conduct a differential gene expression analysis using Deseq2.This tool enables the identification of genes showing differential expression across various experimental conditions or groups. Our objective is to compare each time point to the with the baseline time, DIV 0, determining the count of genes exhibiting significant upregulation and downregulation.

Every passage and my decisions are explained in the comments.

# Store the data, colnames, rownames and info
mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
# In the levels, it is important to put "0" as first, so the analysis uses it as base line to confront the others
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
# Create the levels also for fraction
info$Fraction <- factor(info$Fraction, levels=c("Nuclear", "Cytoplasmic"))

# Design formula is used to estimate dispersion and log2 fold changes
# In info, it is possible to see that there are only two variables giving the variability: DIV and Fraction.
# group is just the "union" of the two.
# Since Fraction is giving variability in the data, but we want only to see the variability due to the timepoints, I use Fraction + DIV and then select just the results relative to DIV
# In this case, DIV is the factor of interest and Fraction is another source of variation in the data, so
# DIV is put as the second variable
design <- as.formula(~ Fraction + DIV)
model.matrix(design, data=info)
##    (Intercept) FractionCytoplasmic DIV3 DIV7 DIV14 DIV22 DIV35
## 5            1                   0    0    0     0     0     0
## 17           1                   0    0    0     0     0     0
## 29           1                   0    0    0     0     0     0
## 41           1                   0    0    0     0     0     0
## 7            1                   0    1    0     0     0     0
## 19           1                   0    1    0     0     0     0
## 31           1                   0    1    0     0     0     0
## 43           1                   0    1    0     0     0     0
## 9            1                   0    0    1     0     0     0
## 21           1                   0    0    1     0     0     0
## 33           1                   0    0    1     0     0     0
## 45           1                   0    0    1     0     0     0
## 11           1                   0    0    0     1     0     0
## 23           1                   0    0    0     1     0     0
## 35           1                   0    0    0     1     0     0
## 47           1                   0    0    0     1     0     0
## 1            1                   0    0    0     0     1     0
## 13           1                   0    0    0     0     1     0
## 25           1                   0    0    0     0     1     0
## 37           1                   0    0    0     0     1     0
## 3            1                   0    0    0     0     0     1
## 15           1                   0    0    0     0     0     1
## 27           1                   0    0    0     0     0     1
## 39           1                   0    0    0     0     0     1
## 4            1                   1    0    0     0     0     0
## 16           1                   1    0    0     0     0     0
## 28           1                   1    0    0     0     0     0
## 40           1                   1    0    0     0     0     0
## 6            1                   1    1    0     0     0     0
## 18           1                   1    1    0     0     0     0
## 30           1                   1    1    0     0     0     0
## 42           1                   1    1    0     0     0     0
## 8            1                   1    0    1     0     0     0
## 20           1                   1    0    1     0     0     0
## 32           1                   1    0    1     0     0     0
## 44           1                   1    0    1     0     0     0
## 10           1                   1    0    0     1     0     0
## 22           1                   1    0    0     1     0     0
## 34           1                   1    0    0     1     0     0
## 46           1                   1    0    0     1     0     0
## 12           1                   1    0    0     0     1     0
## 24           1                   1    0    0     0     1     0
## 36           1                   1    0    0     0     1     0
## 48           1                   1    0    0     0     1     0
## 2            1                   1    0    0     0     0     1
## 14           1                   1    0    0     0     0     1
## 26           1                   1    0    0     0     0     1
## 38           1                   1    0    0     0     0     1
## attr(,"assign")
## [1] 0 1 2 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$Fraction
## [1] "contr.treatment"
## 
## attr(,"contrasts")$DIV
## [1] "contr.treatment"
# Create an object of class DESeqDataSet
# We select only the genes that are expressed global
dds <- DESeq2::DESeqDataSetFromMatrix(countData = mycountData[is_expressed_global,],
                                      colData = info,
                                      design= design)

# Run the function DEseq2(), which will compute the following:
# 1. Estimating size factors
# 2. Estimating dispersions
# 3. Gene-wise dispersion estimates
# 4. Mean-dispersion relationship
# 5. Final dispersion estimates
# 6. Fitting model and testing
dds <- DESeq2::DESeq(dds)

# Lists the coefficients
mycoeff <- resultsNames(dds)
# Extract the coefficients related to DIV
mycoeff_DIV <- mycoeff[startsWith(mycoeff, "DIV")]

# Function to extract the results 
results_computation <- function(name) {
  res <- results(dds, name = name)
  return (res)
}

# Extract the results for the comparison between time points
DIV_results <- sapply(mycoeff_DIV, results_computation)

# Function to find the number of upregulated genes 
genes_upregulated_computation <- function(res) {
    # To select the upregulated genes, we consider a log2FC higher than 1, and a adjusted p-value lower than 0.01
  # The adjusted p-value is obtained with Benjamini and Hochberg method
  genes_up <- as.character(rownames(res))[res$log2FoldChange>1.0&res$padj<0.01]
  return (genes_up)
}

# Function to find the number of downregulated genes 
genes_downregulated_computation <- function(res) {
  # To select the downregulated genes, we consider a log2FC lower than -1, and a adjusted p-value lower than 0.01
  # The adjusted p-value is obtained with Benjamini and Hochberg method
  genes_do <- as.character(rownames(res))[res$log2FoldChange<(-1.0)&res$padj<0.01]
  return (genes_do)
}

# Find the upregulated and downregulated genes
genes_upregulated <- sapply(DIV_results, genes_upregulated_computation)
genes_downregulated <- sapply(DIV_results, genes_downregulated_computation)

# Find the number of upregulated and downregulated genes
genes_upregulated_number <- sapply(genes_upregulated, length)
genes_downregulated_number <- sapply(genes_downregulated, length)
# Create a layout with 1 row and 2 columns
par(mfrow=c(1, 2))

# Bar plot of upregulated genes
up_bar <- barplot(genes_upregulated_number, 
                   names.arg=mycoeff_DIV, 
                   xlab="Comparison between time points", 
                   ylab="Number of genes", 
                   main="Number of upregulated genes", 
                   col=mycols_days[2:length(mycols_days)], 
                   cex.names=0.7, 
                   ylim=c(0, 4800))  

text(x = up_bar, y = genes_upregulated_number + 0.5, label = genes_upregulated_number, pos = 3)

# Bar plot of downregulated genes
down_bar <- barplot(genes_downregulated_number, 
                     names.arg=mycoeff_DIV, 
                     xlab="Comparison between time points", 
                     ylab="Number of genes", 
                     main="Number of downregulated genes", 
                     col=mycols_days[2:length(mycols_days)], 
                     cex.names=0.7, 
                     ylim=c(0, 4800)) 

text(x = down_bar, y = genes_downregulated_number + 0.5, label = genes_downregulated_number, pos = 3)

Comment: The number of differential expressed genes increases progressively with time, with the biggest step between DIV_14_vs_0 and DIV_22_vs_0. Since we are comparing the baseline (DIV 0) pluripotent stem cells (iPSCs), with cells that are gradually becoming more specialized, the growing number of genes with different expression levels over time is expected. As stem cells differentiate into specific cell types, the expression of genes changes to carry out their new specific functions.

  1. Question: For each time point, display the volcano plot showing the differentially expressed genes.
#Volcano plot
VolcanoPlot <- function(myDGE=res, col_log2FC=res$log2FoldChange, col_pval=res$padj, title){
  mycol_genes <- rep(rgb(0.7,0.7,0.7,0.2))
  mycol_genes[abs(col_log2FC)>=1.0&col_pval<=0.05] <- rep(rgb(0.0,0.0,0.0,0.2))
  plot(col_log2FC,-log10(col_pval),pch=19,cex=0.3,col=mycol_genes,xlab="log2FC",ylab="-log10(P-value)", main=title, frame=FALSE, ylim=c(0, 270))
  abline(h=-log10(0.01),lty=2,col="grey")
  abline(v=c(-1,1),lty=2,col="grey")
}
#Volcano plots
par(mfrow=c(1,5))

invisible(sapply(mycoeff_DIV, function(x){VolcanoPlot(myDGE=DIV_results[[x]], col_log2FC=DIV_results[[x]]$log2FoldChange, col_pval=DIV_results[[x]]$padj, title=x)}))

Comment: This plot shows the number of differentially expressed genes (corresponding to the dots). The upregulated genes have a positive (greater than 1) log2FC, while the dowregulated genes have a negative (smaller than -1) log2FC. This plot also shows how genes are distributed based on two measures: adjusted p-value (calculated using the Benjamini and Hochberg method) and the log2 fold change. We can see that as time progresses (compared to the baseline DIV 0), two things happen: the total number of differentially expressed genes increases (the dots), and there’s also a rise in the number of genes with a very low adjusted p-value (high -log10 p-value).

  1. Question: For the time point D0–>D3, show the biological pathways associated with the upregulated and downregulated genes. Comment.
# Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
  gostres_diff <- gost(query = genes_list, 
                  organism = "hsapiens", ordered_query = FALSE, 
                  multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                  measure_underrepresentation = FALSE, evcodes = TRUE, 
                  user_threshold = 0.05, correction_method = "g_SCS", 
                  domain_scope = "annotated", custom_bg = NULL, 
                  numeric_ns = "", as_short_link = FALSE, sources=c("GO:BP", "GO:MF","KEGG"))
  gostplot(gostres_diff, capped = FALSE, interactive = TRUE) # please note this is going to create an interactive plot
}
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_upregulated_DIV3 = genes_upregulated[[1]]
GO_analysis(genes_upregulated_DIV3)
# Find the biological pathways associated with the downregulated genes in DIV 3 compared to DIV 0
genes_downregulated_DIV3 = genes_downregulated[[1]]
GO_analysis(genes_downregulated_DIV3)

Comment: Considering GO:BP (Gene Ontology - Biological Process), the upregulated genes seem to be linked to developmental processes (system development, anatomical structure development, developmental processes, nervous system development), while the downregulated genes seem to be linked to signaling processes (cell communication, signaling, signal transduction). Considering KEGG, we find the Wnt signaling pathway in the upregulated, and neuroactive ligand-receptor interaction in the dowregulated, as the ones with more statistical significance. The upregulation of developmental processes drives cellular differentiation (from hiPSCs at DIV 0 to NPCs at DIV 3). NPCs dowregulated general communication and signaling pathways, but upregulate more specific signaling pathways, in this case wingless (WNT) signaling pathway. By analyzing the enriched pathways, we can gain a deeper understanding of genes and processes involved in cellular differentiation of hiPSCs (human induced pluripotent stem cells, DIV 0) cinto NPCs (neural precursor cells, DIV 3).

Task 2 - Bonus (0.25 pt)

Do the same as task 1 but using EdgeR – Likelihood ratio tests. Comment on the differences between the techniques.

Answer: As before, please see the comments to understand the passages and my choices. Two different tests are techniques: edgeR with likelihood ratio tests and edgeR with quasi-likelihood F-tests.

EdgeR - Likelihood ratio tests

# Store the data, colnames, rownames and info
mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
# In the levels, it is important to put "0" as first, so the analysis uses it as base line to confront the others
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
info$Fraction <- factor(info$Fraction, levels=c("Nuclear", "Cytoplasmic"))
# I used the same design as before, in order to be able to compare the different methods
design <- as.formula(~ Fraction + DIV)
# We select only the genes that are expressed global
y <- DGEList(counts=mycountData[is_expressed_global,], group=info$group)
design <- model.matrix(design, data=info)
y <- estimateDisp(y,design)

# To perform likelihood ratio tests:
# Fit the model
fit <- glmFit(y,design=design)
# Get the coefficients
mycoefs <- colnames(fit$design)
# list of coefficients we are interested in
coefficients <- c(3, 4, 5, 6, 7)

# Function to find the number of upregulated genes 
genes_upregulated_computation_edgeR <- function(coeff) {
  lrt.nuc <- glmLRT(fit, coef=coeff)
  res <- topTags(lrt.nuc, n=sum(is_expressed_global))$table
  genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
  return (genes_up)
}

# Function to find the number of downregulated genes 
genes_downregulated_computation_edgeR <- function(coeff) {
  lrt.nuc <- glmLRT(fit, coef=coeff)
  res <- topTags(lrt.nuc, n=sum(is_expressed_global))$table
  genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]
  return (genes_do)
}

# Find the upregulated and downregulated genes
genes_upregulated_edgeR <- sapply(coefficients, genes_upregulated_computation_edgeR)
genes_downregulated_edgeR <- sapply(coefficients, genes_downregulated_computation_edgeR)

# Find the number of upregulated and downregulated genes
genes_upregulated_number_edgeR <- sapply(genes_upregulated_edgeR, length)
genes_downregulated_number_edgeR <- sapply(genes_downregulated_edgeR, length)
# Create a layout with 1 row and 2 columns
par(mfrow=c(1, 2))

# Bar plot of upregulated genes
up_bar <- barplot(genes_upregulated_number_edgeR, 
                   names.arg=mycoeff_DIV, 
                   xlab="Comparison between time points", 
                   ylab="Number of genes", 
                   main="Number of upregulated genes", 
                   col=mycols_days[2:length(mycols_days)], 
                   cex.names=0.7, 
                   ylim=c(0, 5000))

text(x = up_bar, y = genes_upregulated_number_edgeR + 0.5, label = genes_upregulated_number_edgeR, pos = 3)

# Bar plot of downregulated genes
down_bar <- barplot(genes_downregulated_number_edgeR, 
                     names.arg=mycoeff_DIV, 
                     xlab="Comparison between time points", 
                     ylab="Number of genes", 
                     main="Number of downregulated genes", 
                     col=mycols_days[2:length(mycols_days)], 
                     cex.names=0.7, 
                     ylim=c(0, 5000))

text(x = down_bar, y = genes_downregulated_number_edgeR + 0.5, label = genes_downregulated_number_edgeR, pos = 3)

Comment: With edgeR, we find similar results to the previous method (DESeq2). Even in this case it is possible to see that the biggest step is between DIV 14 vs 0 and DIV 22 vs 0. With DESeq2, we find more upregulated genes compared to edgeR, while with edgeR we find more dowregulated genes compared to DESeq2.

#Volcano plots
par(mfrow=c(1,5))

invisible(sapply(coefficients, function(coeff) {
  # Perform glmLRT and topTags computations
  lrt.nuc <- glmLRT(fit, coef=coeff)
  res <- topTags(lrt.nuc, n=sum(is_expressed_global))$table
  # Generate VolcanoPlot
  VolcanoPlot(myDGE=res, col_log2FC = res$logFC, col_pval = res$FDR, title = mycoefs[coeff])
}))

Comment: As timepoint (compared to the initial one, zero) increases, the number of differential expressed genes increases and also the number of genes with a very low adjusted p-value increases, just as in the previous method. However, in DESeq2 the -log of adjusted p-values were slightly higher than with edgeR, meaning that edgeR is more conservative than DESeq2.

# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_upregulated_DIV3_edgeR = genes_upregulated_edgeR[[1]]
GO_analysis(genes_upregulated_DIV3_edgeR)
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_downregulated_DIV3_edgeR = genes_downregulated_edgeR[[1]]
GO_analysis(genes_downregulated_DIV3_edgeR)

Comment: The results found using edgeR are very similar to the ones using DESeq2 for the GO enrichment: the uperegulated genes are related to developmental processes, while the downregulated genes are linked to signaling processes.

EdgeR - Quasi-likelihood F-tests

# Store the data, colnames, rownames and info
mycountData           <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)

# In the levels, it is important to put "0" as first, so the analysis uses it as base line to confront the others
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
info$Fraction <- factor(info$Fraction, levels=c("Nuclear", "Cytoplasmic"))
# We select only the genes that are expressed global
y  <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
# I used the same design as before, in order to be able to compare the different methods
design <- as.formula(~ info$Fraction + info$DIV)
design <- model.matrix(design)
y <- estimateDisp(y,design)

# To perform quasi-likelihood F-tests:
# Fit the model
fit <- glmQLFit(y,design)
# Get the coefficients
mycoefs <- colnames(fit$design)
# list of coefficients we are interested in
coefficients <- c(3, 4, 5, 6, 7)

# Function to find the number of upregulated genes 
genes_upregulated_computation_edgeR_QL <- function(coeff) {
  qlf.nuc <- glmQLFTest(fit, coef=coeff)
  res<- topTags(qlf.nuc,n=sum(is_expressed_global))$table
  genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
  return (genes_up)
}

# Function to find the number of downregulated genes 
genes_downregulated_computation_edgeR_QL <- function(coeff) {
  qlf.nuc <- glmQLFTest(fit, coef=coeff)
  res<- topTags(qlf.nuc,n=sum(is_expressed_global))$table
  genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]
  return (genes_do)
}


# Find the upregulated and downregulated genes
genes_upregulated_edgeR_QL <- sapply(coefficients, genes_upregulated_computation_edgeR_QL)
genes_downregulated_edgeR_QL <- sapply(coefficients, genes_downregulated_computation_edgeR_QL)

# Find the number of upregulated and downregulated genes
genes_upregulated_number_edgeR_QL <- sapply(genes_upregulated_edgeR_QL, length)
genes_downregulated_number_edgeR_QL <- sapply(genes_downregulated_edgeR_QL, length)
# Create a layout with 1 row and 2 columns
par(mfrow=c(1, 2))

# Bar plot of upregulated genes
up_bar <- barplot(genes_upregulated_number_edgeR_QL, 
                   names.arg=mycoeff_DIV, 
                   xlab="Comparison between time points", 
                   ylab="Number of genes", 
                   main="Number of upregulated genes", 
                   col=mycols_days[2:length(mycols_days)], 
                   cex.names=0.7, 
                   ylim=c(0, 5000))

text(x = up_bar, y = genes_upregulated_number_edgeR_QL + 0.5, label = genes_upregulated_number_edgeR_QL, pos = 3)

# Bar plot of downregulated genes
down_bar <- barplot(genes_downregulated_number_edgeR_QL, 
                     names.arg=mycoeff_DIV, 
                     xlab="Comparison between time points", 
                     ylab="Number of genes", 
                     main="Number of downregulated genes", 
                     col=mycols_days[2:length(mycols_days)], 
                     cex.names=0.7, 
                     ylim=c(0, 5000))

text(x = down_bar, y = genes_downregulated_number_edgeR_QL + 0.5, label = genes_downregulated_number_edgeR_QL, pos = 3)

Comment: Here the results are slightly less than edgeR with likelihood tests, because quasi-likelihood tests are more conservatives. We find the same trends as before.

#Volcano plots
par(mfrow=c(1,5))

invisible(sapply(coefficients, function(coeff) {
  # Perform glmQLFTest and topTags computations
  qlf.nuc <- glmQLFTest(fit, coef=coeff)
  res <- topTags(qlf.nuc, n=sum(is_expressed_global))$table
  # Generate VolcanoPlot
  VolcanoPlot(myDGE=res, col_log2FC = res$logFC, col_pval = res$FDR, title = mycoefs[coeff])
}))

Comment: The values of -10 log of adjusted p-values are quite smaller compared to the previous two methods. In fact, edgeR calculates p-values via the quasi-likelihood method, while DESeq uses Wald tests. For this reason, edgeR with the quasi-likelihood method is more conservative than the previous methods.

# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_upregulated_DIV3_edgeR_QL = genes_upregulated_edgeR_QL[[1]]
GO_analysis(genes_upregulated_DIV3_edgeR_QL)
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_downregulated_DIV3_edgeR_QL = genes_downregulated_edgeR_QL[[1]]
GO_analysis(genes_downregulated_DIV3_edgeR_QL)

Comment: Also in this case, the pathways found looking at the upregulated genes are connected to the development, while the ones found looking at the downregulated genes are connected to cell communication and signaling pathways.